112

Binary Neural Architecture Search

  

 

 

  









(a)

(c)

(b)

FIGURE 4.14

The loss landscape illustration of supernet. (a) The gradient of current weights with different

α, (b) the vanilla αt+1 with backpropagation, (c) ˜αt+1 with the decoupled optimization.

a(j)

=



i<j softmax(α(i,j)

m

)(w(i,j) a(i)), where w(i,j)

=

[[wm]]

RM×1, wm

RCout×Cin×Km×Km denotes the weights of all candidate operations between the i-th and

j-th nodes and Km denotes the kernel size of the m-th operation. Specifically, for pooling

and identity operations, Km equals the downsample size and the size of the feature map,

wm equals 1/(Km × Km) and 1, respectively. For each intermediate node, its output a(j) is

jointly determined by α(i,j)

m

and w(i,j)

m

, while a(i) is independent of both α(i,j)

m

and w(i,j)

m

. As

shown in Figs. 4.14 (a) and (b), with different α, the gradient of the corresponding w can

be varied and sometimes difficult to optimize, possibly trapped in local minima. However,

by decoupling α and w, the supernet can jump out of the local minima and be optimized

with better convergence.

Based on the deviation and analysis above, we propose our objective for optimizing the

neural architecture search process

arg min

α,w L(w, α) =



LNAS + reg(w),

for Parent model

LDCP-NAS + reg(w),

for Child model

(4.42)

where αRE×M, wRM×1, and reg(·) denotes the regularization item. Following

[151, 265], the weights w and the architectural parameters α are optimized sequentially,

in which w and α are updated independently. However, optimizing w and α independently

is improper due to their coupling relationship. We consider the searching and training pro-

cess of differentiable Chile-Parent neural architecture search as a coupling optimization

problem and solve the problem using a new backtracking method. Details will be shown in

Section 4.4.6.

Decoupled Optimization for Child-Parent model From a new perspective, we recon-

sider the coupling relation between w and α. The derivative calculation process of w should

consider its coupling parameters α. Based on the chain rule [187] and its notation, we have

the following.

˜αt+1 = αt + η1(L(αt, wt)

∂αt

+ η2Tr[(L(αt, wt)

wt

)Twt

∂αt ])

= αt+1 + η1η2Tr[(L(αt, wt)

wt

)Twt

∂αt ],

(4.43)

where η1 represents the learning rate, η2 represents the backtracking coefficient, and ˜αt+1

denotes the value after the backtracking of vanilla αt+1. In contrast, vanilla αt+1 is calcu-

lated from the backpropagation rule and the corresponding optimizer in the neural network.